Intro

The TSstudio provides a set of tools for time series analysis and forecasting application. The following document demonstrates the usage of the package to forecast the monthly consumption of natural gas in the US in the next 5 years (or 60 months)

Installation

Install from CRAN:

install.packages("TSstudio")

Or from Github:

devtools::install_github("RamiKrispin/TSstudio")

As for December 2018, the most updated version is 0.1.3

library(TSstudio)

Data

The USgas dataset, one of the package datasets, represents the monthly consumption (in Billion Cubic Feet) of natural gas in the US since January 2000 [1]:

# Load the series
data("USgas")

# Get the series info
ts_info(USgas)
##  The USgas series is a ts object with 1 variable and 227 observations
##  Frequency: 12 
##  Start time: 2000 1 
##  End time: 2018 11
# Plot the series
ts_plot(USgas,
        title = "US Monthly Natural Gas Consumption",
        Ytitle = "Billion Cubic Feet",
        Xtitle = "Source: Federal Reserve Bank of St. Louis", 
        slider = TRUE)

Exploratory analysis

a

Decomposing the series

ts_decompose(USgas)

Seasonal analysis

You can note from the plot above that the series has strong seasonality and fairly stable trend (or growth). We will use the ts_seasonal and the ts_heatmap the explore further the seasonality and trend of the series:

ts_seasonal(USgas, type = "all")

The ts_seasonal function provides three different views of the seasonality of the series (when the type argument is equal to "all"):

  • The upper plot splits and alinged the series by the cycle units of the series, which in this case the cycle length of the series is one year and the cycle units are the months of the year (or the series frequency). You can not from that plot that:
    • The consumption has high depandency to the month of the year (e.g., high consumption during the winter and low throughout the summer)
    • The color scale of the plot, which gradualy change from red for early years to blue for most recent years. Indicates that there is a gradual growth in the consumption from year to year
  • The middle plot represents the change from year to year of each frequency units. Here we can notice that, in most of the years, the order of the consumption magnitute remains the same (e.g., in most of the years January has the highest consumption and May and September the lowest)
  • The last plot, provides a good overview of the consumption variation by frequency units (or month of the year). You can notice that the variation of the consumption during the summer time is lower (or less seneative), as opposed to the ones throughout the winter months

The combination of the three plots provides the full picture on the series seaosnality. Alternativliy, you can use the ts_heatmap to get a time series heatmap represenative of the series:

ts_heatmap(USgas)
ts_surface(USgas)

Correlation analysis

The next step is to identify the level of correlation between the series and it’s lags, using the ts_acf function, which is nothing but an interactive version of the acf function:

ts_acf(USgas, lag.max = 36)

As expected you can notice that the series is highely correlated with it’s seasonal lags. A more intuative way to review identify a deoandency between the series and its lags is with the ts_lags function, which provides plots of the series against its lags:

ts_lags(USgas)

As observed before with the ts_acf function, you can notice that the seasonal lag (or lag 12) of the series has a strong linear relationship with the series. In a semilar way we can zoom in on the seasonal lags of the series using the lags argument:

ts_lags(USgas, lags = c(12, 24, 36, 48))

Forecasting the series

Using the information we learned from the seasonal and correlation analysis, we will conduct “horse race” between several models and select the one that performe best on the testing set, by using two training approaches:

  • Traditional approach - by splitting the series into training and testing (sample out) partitions. Train each model on the training set and evluate its perfromance on the testing set. We will use the following four models - auto.arima, ets, nnetar and tslm from the forecast package

  • Backtesting - by using expending window to train and test each model on multiple training and testing sets. We will utilize the ts_backtesting function to train multiple models from the forecast, forecastHybrid, and bsts packages.

Traditional Approach

# Set the sample out and forecast horizon
h1 <- 12 # sample out lenght
h2 <- 60 # forecast horizon

USgas_split <- ts_split(USgas, sample.out = h1)
train <- USgas_split$train
test <- USgas_split$test

ts_info(train)
##  The train series is a ts object with 1 variable and 215 observations
##  Frequency: 12 
##  Start time: 2000 1 
##  End time: 2017 11
ts_info(test)
##  The test series is a ts object with 1 variable and 12 observations
##  Frequency: 12 
##  Start time: 2017 12 
##  End time: 2018 11
set.seed(1234)

library(forecast)

# auto.arima
md1 <- auto.arima(train, 
                  stepwise = FALSE, 
                  approximation = FALSE,
                  D = 1)
fc1 <- forecast(md1, h = h1)
accuracy(fc1, test)
##                      ME      RMSE       MAE        MPE     MAPE      MASE
## Training set  -1.586018  98.37274  72.07709 -0.2748132 3.497391 0.6718145
## Test set     198.515290 216.81325 200.60105  7.9783005 8.055568 1.8697577
##                     ACF1 Theil's U
## Training set  0.02180888        NA
## Test set     -0.02214037  0.815383
# ETS
md2 <- ets(train, opt.crit = "mse")
fc2 <- forecast(md2, h = h1)
accuracy(fc2, test)
##                      ME      RMSE       MAE        MPE     MAPE      MASE
## Training set   4.530822  99.61901  73.11756 0.07081195 3.561811 0.6815125
## Test set     114.734134 173.88352 155.67014 4.94191827 6.458404 1.4509667
##                    ACF1 Theil's U
## Training set 0.06787102        NA
## Test set     0.41614139 0.6901814
# Neural network time series forecasts
md3 <- nnetar(train, 
              P = 2, # using 2 seasonal lags
              p = 1, # and 1 non-seasonal lags
              repeats = 100)
fc3 <- forecast(md3, h = h1)
accuracy(fc3, test)
##                        ME     RMSE       MAE        MPE     MAPE      MASE
## Training set   0.01423023 119.9402  94.28152 -0.3203614 4.587159 0.8787771
## Test set     212.30940334 237.3941 212.30940  8.1450374 8.145037 1.9788886
##                   ACF1 Theil's U
## Training set 0.3865153        NA
## Test set     0.4343893 0.8083805
# Time series linear regression
md4 <- tslm(train ~ season + trend + I(trend ^ 2))
fc4 <- forecast(md4, h = h1)
accuracy(fc4, test)
##                                   ME     RMSE       MAE        MPE
## Training set -0.00000000000001585288 105.1812  76.25964 -0.2349141
## Test set     88.96024173258881262427 134.7567 118.67616  3.4109552
##                  MAPE      MASE        ACF1 Theil's U
## Training set 3.706897 0.7107991  0.47320514        NA
## Test set     4.559134 1.1061541 -0.03942927 0.4659541

We will utlize the test_forecast function to visualize the goodness of fit of each model on both the training and testing partitions:

test_forecast(forecast.obj = fc1, actual = USgas, test = test)
test_forecast(forecast.obj = fc2, actual = USgas, test = test)
test_forecast(forecast.obj = fc3, actual = USgas, test = test)
test_forecast(forecast.obj = fc4, actual = USgas, test = test)